Search CORE

249 research outputs found

Metagenome and Metatranscriptome Analyses Using Protein Family Profiles

Author: Edlund Anna
McLean Jeffrey S.
Yang Youngik
Yooseph Shibu
Zhong Cuncong
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/07/2016
Field of study

Analyses of metagenome data (MG) and metatranscriptome data (MT) are often challenged by a paucity of complete reference genome sequences and the uneven/low sequencing depth of the constituent organisms in the microbial community, which respectively limit the power of reference-based alignment and de novo sequence assembly. These limitations make accurate protein family classification and abundance estimation challenging, which in turn hamper downstream analyses such as abundance profiling of metabolic pathways, identification of differentially encoded/expressed genes, and de novo reconstruction of complete gene and protein sequences from the protein family of interest. The profile hidden Markov model (HMM) framework enables the construction of very useful probabilistic models for protein families that allow for accurate modeling of position specific matches, insertions, and deletions. We present a novel homology detection algorithm that integrates banded Viterbi algorithm for profile HMM parsing with an iterative simultaneous alignment and assembly computational framework. The algorithm searches a given profile HMM of a protein family against a database of fragmentary MG/MT sequencing data and simultaneously assembles complete or near-complete gene and protein sequences of the protein family. The resulting program, HMM-GRASPx, demonstrates superior performance in aligning and assembling homologs when benchmarked on both simulated marine MG and real human saliva MG datasets. On real supragingival plaque and stool MG datasets that were generated from healthy individuals, HMM-GRASPx accurately estimates the abundances of the antimicrobial resistance (AMR) gene families and enables accurate characterization of the resistome profiles of these microbial communities. For real human oral microbiome MT datasets, using the HMM-GRASPx estimated transcript abundances significantly improves detection of differentially expressed (DE) genes. Finally, HMM-GRASPx was used to reconstruct comprehensive sets of complete or near-complete protein and nucleotide sequences for the query protein families. HMM-GRASPx is freely available online from http://sourceforge.net/projects/hmm-graspx

KU ScholarWorks

Directory of Open Access Journals

PubMed Central

FigShare

Evolution of allostery in the cyclic nucleotide binding module

Author: Anand Ganesh S
Kannan Natarajan
Neuwald Andrew F
Taylor Susan S
Venter J Craig
Wu Jian
Yooseph Shibu
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Analysis of cyclic nucleotide binding (CNB) domains shows that they have evolved to sense a wide variety of second messenger signals; a mechanism for allosteric regulation by CNB domains is proposed

Crossref

Springer - Publisher Connector

PubMed Central

ScholarBank@NUS

A versatile palindromic amphipathic repeat coding sequence horizontally distributed among diverse bacterial and eucaryotic microbes

Author: Calcutt Michael J
Foecking Mark F
Glass John I
Röske Kerstin
Wise Kim S
Yooseph Shibu
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Intragenic tandem repeats occur throughout all domains of life and impart functional and structural variability to diverse translation products. Repeat proteins confer distinctive surface phenotypes to many unicellular organisms, including those with minimal genomes such as the wall-less bacterial monoderms, <it>Mollicutes</it>. One such repeat pattern in this clade is distributed in a manner suggesting its exchange by horizontal gene transfer (HGT). Expanding genome sequence databases reveal the pattern in a widening range of bacteria, and recently among eucaryotic microbes. We examined the genomic flux and consequences of the motif by determining its distribution, predicted structural features and association with membrane-targeted proteins. Results Using a refined hidden Markov model, we document a 25-residue protein sequence motif tandemly arrayed in variable-number repeats in ORFs lacking assigned functions. It appears sporadically in unicellular microbes from disparate bacterial and eucaryotic clades, representing diverse lifestyles and ecological niches that include host parasitic, marine and extreme environments. Tracts of the repeats predict a malleable configuration of recurring domains, with conserved hydrophobic residues forming an amphipathic secondary structure in which hydrophilic residues endow extensive sequence variation. Many ORFs with these domains also have membrane-targeting sequences that predict assorted topologies; others may comprise reservoirs of sequence variants. We demonstrate expressed variants among surface lipoproteins that distinguish closely related animal pathogens belonging to a subgroup of the <it>Mollicutes</it>. DNA sequences encoding the tandem domains display dyad symmetry. Moreover, in some taxa the domains occur in ORFs selectively associated with mobile elements. These features, a punctate phylogenetic distribution, and different patterns of dispersal in genomes of related taxa, suggest that the repeat may be disseminated by HGT and intra-genomic shuffling. Conclusions We describe novel features of PARCELs (Palindromic Amphipathic Repeat Coding ELements), a set of widely distributed repeat protein domains and coding sequences that were likely acquired through HGT by diverse unicellular microbes, further mobilized and diversified within genomes, and co-opted for expression in the membrane proteome of some taxa. Disseminated by multiple gene-centric vehicles, ORFs harboring these elements enhance accessory gene pools as part of the "mobilome" connecting genomes of various clades, in taxa sharing common niches.</p

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Global ecotypes in the ubiquitous marine clade SAR86

Author: Arnosti C.
Dupont C.L.
Hoarfrost A.
Ladau J.
Nayfach S.
Pollard K.S.
Yooseph S.
Publication venue: Springer Nature
Publication date: 01/01/2020
Field of study

SAR86 is an abundant and ubiquitous heterotroph in the surface ocean that plays a central role in the function of marine ecosystems. We hypothesized that despite its ubiquity, different SAR86 subgroups may be endemic to specific ocean regions and functionally specialized for unique marine environments. However, the global biogeographical distributions of SAR86 genes, and the manner in which these distributions correlate with marine environments, have not been investigated. We quantified SAR86 gene content across globally distributed metagenomic samples and modeled these gene distributions as a function of 51 environmental variables. We identified five distinct clusters of genes within the SAR86 pangenome, each with a unique geographic distribution associated with specific environmental characteristics. Gene clusters are characterized by the strong taxonomic enrichment of distinct SAR86 genomes and partial assemblies, as well as differential enrichment of certain functional groups, suggesting differing functional and ecological roles of SAR86 ecotypes. We then leveraged our models and high-resolution, remote sensing-derived environmental data to predict the distributions of SAR86 gene clusters across the world’s oceans, creating global maps of SAR86 ecotype distributions. Our results reveal that SAR86 exhibits previously unknown, complex biogeography, and provide a framework for exploring geographic distributions of genetic diversity from other microbial clades

Carolina Digital Repository

ProtoNet 6.0: organizing 10 million protein sequences in a compact hierarchical family tree

Author: A. Stern
Attwood
Bairoch
Brown
Bru
Fleischmann
Henikoff
Kaplan
Loewenstein
M. Linial
N. Linial
N. Rappoport
Pearl
S. Karsenty
Sasson
Watson
Wu
Yooseph
Publication venue: Oxford University Press
Publication date
Field of study

ProtoNet 6.0 (http://www.protonet.cs.huji.ac.il) is a data structure of protein families that cover the protein sequence space. These families are generated through an unsupervised bottom–up clustering algorithm. This algorithm organizes large sets of proteins in a hierarchical tree that yields high-quality protein families. The 2012 ProtoNet (Version 6.0) tree includes over 9 million proteins of which 5.5% come from UniProtKB/SwissProt and the rest from UniProtKB/TrEMBL. The hierarchical tree structure is based on an all-against-all comparison of 2.5 million representatives of UniRef50. Rigorous annotation-based quality tests prune the tree to most informative 162 088 clusters. Every high-quality cluster is assigned a ProtoName that reflects the most significant annotations of its proteins. These annotations are dominated by GO terms, UniProt/Swiss-Prot keywords and InterPro. ProtoNet 6.0 operates in a default mode. When used in the advanced mode, this data structure offers the user a view of the family tree at any desired level of resolution. Systematic comparisons with previous versions of ProtoNet are carried out. They show how our view of protein families evolves, as larger parts of the sequence space become known. ProtoNet 6.0 provides numerous tools to navigate the hierarchy of clusters

Crossref

PubMed Central

Data growth and its impact on the SCOP database: new developments

Author: A. Andreeva
A. G. Murzin
Altschul
Andreeva
Andreeva
Berman
C. Chothia
Chandonia
Chandonia
D. Howorth
Finn
J.-M. Chandonia
Lo Conte
Moroz
Murzin
S. E. Brenner
T. J. P. Hubbard
Wheeler
Yooseph
Publication venue: Oxford University Press
Publication date: 13/11/2007
Field of study

The Structural Classification of Proteins (SCOP) database is a comprehensive ordering of all proteins of known structure, according to their evolutionary and structural relationships. The SCOP hierarchy comprises the following levels: Species, Protein, Family, Superfamily, Fold and Class. While keeping the original classification scheme intact, we have changed the production of SCOP in order to cope with a rapid growth of new structural data and to facilitate the discovery of new protein relationships. We describe ongoing developments and new features implemented in SCOP. A new update protocol supports batch classification of new protein structures by their detected relationships at Family and Superfamily levels in contrast to our previous sequential handling of new structural data by release date. We introduce pre-SCOP, a preview of the SCOP developmental version that enables earlier access to the information on new relationships. We also discuss the impact of worldwide Structural Genomics initiatives, which are producing new protein structures at an increasing rate, on the rates of discovery and growth of protein families and superfamilies. SCOP can be accessed at http://scop.mrc-lmb.cam.ac.uk/scop

King's Research Portal

Analysis and comparison of very large metagenomes with fast clustering and functional annotation

Author: AC McHardy
AR Quinlan
B Rodriguez-Brito
D Sheskin
DB Rusch
DC Richter
DH Huson
E Portugaly
EA Dinsdale
EF DeLong
FE Angly
GW Tyson
H Noguchi
H Noguchi
H Teeling
H Teeling
J Shendure
JC Venter
K Mavromatis
KJ Hoff
L Krause
PD Schloss
R Seshadri
RK Aziz
S Yooseph
S Yooseph
SF Altschul
SG Tringe
SR Eddy
SR Gill
W Li
W Li
W Li
W Li
Weizhong Li
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background The remarkable advance of metagenomics presents significant new challenges in data analysis. Metagenomic datasets (metagenomes) are large collections of sequencing reads from anonymous species within particular environments. Computational analyses for very large metagenomes are extremely time-consuming, and there are often many novel sequences in these metagenomes that are not fully utilized. The number of available metagenomes is rapidly increasing, so fast and efficient metagenome comparison methods are in great demand. Results The new metagenomic data analysis method Rapid Analysis of Multiple Metagenomes with a Clustering and Annotation Pipeline (RAMMCAP) was developed using an ultra-fast sequence clustering algorithm, fast protein family annotation tools, and a novel statistical metagenome comparison method that employs a unique graphic interface. RAMMCAP processes extremely large datasets with only moderate computational effort. It identifies raw read clusters and protein clusters that may include novel gene families, and compares metagenomes using clusters or functional annotations calculated by RAMMCAP. In this study, RAMMCAP was applied to the two largest available metagenomic collections, the "Global Ocean Sampling" and the "Metagenomic Profiling of Nine Biomes". Conclusion RAMMCAP is a very fast method that can cluster and annotate one million metagenomic reads in only hundreds of CPU hours. It is available from <url>http://tools.camera.calit2.net/camera/rammcap/</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Microbiome Analysis of Stool Samples from African Americans with Colon Polyps

Author: Ashktorab H.
Brim H.
Laiyemo A.O.
Lee E.
Nelson K.
Shokrani B.
Torralbo M.
Yooseph S.
Zoetendal E.G.
Publication venue
Publication date: 01/01/2013
Field of study

Background: Colonic polyps are common tumors occurring in similar to 50% of Western populations with similar to 10% risk of malignant progression. Dietary agents have been considered the primary environmental exposure to promote colorectal cancer (CRC) development. However, the colonic mucosa is permanently in contact with the microbiota and its metabolic products including toxins that also have the potential to trigger oncogenic transformation. Aim: To analyze fecal DNA for microbiota composition and functional potential in African Americans with pre-neoplastic lesions. Materials & Methods: We analyzed the bacterial composition of stool samples from 6 healthy individuals and 6 patients with colon polyps using 16S ribosomal RNA-based phylogenetic microarray; the Human intestinal Tract Chip (HITChip) and 16S rRNA gene barcoded 454 pyrosequencing. The functional potential was determined by sequence-based metagenomics using 454 pyrosequencing. Results: Fecal microbiota profiling of samples from the healthy and polyp patients using both a phylogenetic microarraying (HITChip) and barcoded 454 pyrosequencing generated similar results. A distinction between both sets of samples was only obtained when the analysis was performed at the sub-genus level. Most of the species leading to the dissociation were from the Bacteroides group. The metagenomic analysis did not reveal major differences in bacterial gene prevalence/abundances between the two groups even when the analysis and comparisons were restricted to available Bacteroides genomes. Conclusion: This study reveals that at the pre-neoplastic stages, there is a trend showing microbiota changes between healthy and colon polyp patients at the sub-genus level. These differences were not reflected at the genome/functions levels. Bacteria and associated functions within the Bacteroides group need to be further analyzed and dissected to pinpoint potential actors in the early colon oncogenic transformation in a large sample size

Directory of Open Access Journals

Wageningen University & Research Publications

FigShare